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ABSTRACT 

Wc review current methods for building PSF-matching kernels for the purposes of image subtraction 
or coaddition. Such methods use a linear decomposition of the kernel on a series of basis functions. The 
correct choice of these basis functions is fundamental to the efficiency and effectiveness of the matching 
- the chosen bases should represent the underlying signal using a reasonably small number of shapes, 
and/or have a minimum number of user-adjustable tuning parameters. We examine methods whose 
bases comprise multiple Gauss-Hermitc polynomials, as well as a form free basis composed of delta- 
functions. Kernels derived from delta-functions are unsurprisingly shown to be more expressive; they 
are able to take more general shapes and perform better in situations where sum-of-Gaussian methods 
are known to fail. However, due to its many degrees of freedom (the maximum number allowed by the 
kernel size) this basis tends to overfit the problem, and yields noisy kernels having large variance. We 
introduce a new technique to regularize these delta-function kernel solutions, which bridges the gap 
between the generality of delta-function kernels, and the compactness of sum-of-Gaussian kernels. 
Through this rcgularization we are able to create general kernel solutions that represent the intrinsic 
shape of the PSF-matching kernel with only one degree of freedom, the strength of the rcgularization 
A. The role of A is effectively to exchange variance in the resulting difference image with variance 
in the kernel itself. We examine considerations in choosing the value of A, including statistical risk 
estimators and the ability of the solution to predict solutions for adjacent areas. Both of these suggest 
moderate strengths of A between 0.1 and 1.0, although this optimization is likely dataset dependent. 
This model allows for flexible representations of the convolution kernel that have significant predictive 
ability, and will prove useful in implementing robust image subtraction pipelines that must address 
hundreds to thousands of images per night. 

Subject headings: methods: data analysis, techniques: image processing, techniques: photometric 



1. INTRODUCTION 

Studies of variability in astronomy typically use im- 
age subtraction techniques in order to characterize the 
magnitude and type of the variability. This practice in- 
volves subtracting a prior-epoch (generally high signal- 
to-noise) template image from a recent science image; 
any flux remaining in their difference may be attributed 
to phenomena that have varied in the interim. This tech- 
nique is sensitive to both photometric and astromctric 
variability, and can uncover variability o f both point- 
sources (such as stars or supernovae; e.g. lUdalski et al.l 
[2008t iSako et all [200I) an d extended -sources (s uch as 
comets or light echoes; e.g. lNewman fc Restll2006[) . Suc- 
cessful application of this technique shows that it is sen- 
sitive to variability at the P oisson noise limit in a variety 
of astrophysical conditions (lAlard fc Lupton l l 19981 : lAlardI 
[20001 : IBramichI 120081 : IKctIus et al.l 1201(1 7 and in this re- 
gard may be considered optimal. 

There are several reasons for preferring such an ap- 
proach over catalog-based searches. First, many types 
of variability are found in confused regions of the sky, 
and it may be difficult to deblcnd the time-variable sig- 
nal from the non-temporally-variable surrounding area. 
This is particularly true for supernovae and active galac- 
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tic nuclei, which are typically blended with light from 
their host galaxies. However, such confusion is not lim- 
ited to stationary objects. Moving solar system bodies 
may serendipitously yield false brightness enhancements 
in the measurement of a background object if the im- 
pact parameter is small compared to the image's point 
spread function (PSF). For this reason, removal of non- 
variable objects is preferred before attempting to char- 
acterize variable sources in images. 

Image subtraction is also an efficient technique as 
the vast majority of pixels in an image do not 
contain signatures of astrophysical variability. Any 
pixel-level analysis of a difference image will, there- 
fore, be restricted to those sources that are tempo- 
rally variable (as opposed to analyzing all sources 
within an image) . While m any variant s of this tech- 
nique have been publ is hed (iTomanev fc~Crottsl [19961: 
Alard fc LuptM] [l998l : IBramichI 120081 lAlbrow et al.l 
2009fl. and many versions imple mented in auto- 
mated variability-detection pipelines (iBond et al. 2001; 
Rest et all [2 005": 'D arnlev et all [20071: [Miller et al..i2008 : 
Udalski et a l. 2008), there does remain room for improve- 
ment in the robustness of the image subtraction, and 
in the re duction of subtraction artifacts. We refer the 
reader to iWozniakI ([20081 ) for an in-depth summary on 
the practical application of these image subtraction tech- 
niques. 



1.1. Image Subtraction 
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In image subtraction we assume that we have two im- 
ages of the same portion of the sky, taken at different 
epochs, but in the same filter. We will call the image that 
contains the variability of interest the "science" image, 
and the template image to be subtracted the "reference" 
image. The images will, in general, be astrometrically 
misaligned, but this can be resolved by using sinc-based 
image registration methods that preserve the noise prop- 
erties of the original image. After astrometric alignment, 
a given astrophysical object will be represented in the 
reference image as a sub-array of pixels R{x, y) and in 
the science image as S{x,y), with the same span in x 
and y. Each image will, however, have a different point 
spread function (PSF), which is the spatial response of 
a point source due to the atmosphere, telescope optics, 
and instrumental signatures. PSF-matching of the im- 
ages is required before we can subtract one image from 
the other, and is the essence of the image subtraction 
technique. 

2. PSF-MATCHING 

We typically assume that the reference image is a 
high signal-to-noise (S/N) representation of the field, 
for example an image coadd made through a mosaic- 
ing process, or a single image taken on a night with 
particularly good seeing. A stan dard assumption (e.g. 
lAlard fc Luptonlll998l:IAlardll2000t) is that S{x,y) can be 
modeled as a convolution of R{x,y) by a single PSF- 
matching kernel K(u,v), with an additional noise com- 
ponent e(x, y); 



S{x, y) = {K (g) R){x, y) + e{x, y). 



(1) 



Our goal in this paper is to develop an effective method 
for determining K{u, v). 

2.1. Linear Modeling of K{u,v) 

As inputs to the PSF-matching technique, we assume 
images arc astrometrically registered, and background 
subtracted (while this latter constraint is not a neces- 
sity, it docs enable us to restrict our analysis here to the 
respective shapes of the PSFs). To proceed, we make 
the assumption that K{u,v) may be modeled as a lin- 
ear combination of basis functions Ki(u,v) , such that 



K{u,v) = J2i0.iKi{u,v) (|Alard fc LuptonI fl998l ) . The 
basis components do not have to be orthonormal, nor 
does the basis need to be complete (indeed, it may be 
overcomplcte). However, it is desirable to choose a shape 
set that compactly describes A'(w, u), such that the num- 
ber of required terms is small. 

By formulating the kernel decomposition as a linear 
expansion, we may recast Equation [T] as the vectorized 
equation 



S = Ca + e 



(2) 



where C is the matrix of functions Ci = {Ki R) eval- 
uated at each pixel. For any given kernel basis set, the 
goal is to find the coefficients associated with each Ki. 

We proceed using standard linear least squares analy- 
sis. We assume that the noise is uncorrelated and known; 
e is therefore the product of a diagonal matrix S^/^, 
which represents the square root of the known per-pixel 
variance, and a zero-mean unit-variance random vari- 
able Z . By reweighting by the inverse square root of E 



(which must exist as covariance matrices are positive def- 
inite and hence invertiblc) we obtain the modified equa- 
tion 



(3) 



which is just another linear model, now with the error 
term Z having the identity matrix for the covariance. 
This reduces to the weighted linear least squares equa- 
tion 



S^Ca + Z, 



with 



(4) 
(5) 



The normal equations for estimating a are: 

C'^S = C'^Ca (6) 



This may be cast in the familiar form of 6 = Ma with 

(7) 



In discrete pixel coordinates, this corresponds to 

Ct{x,y)S{x,y) 



M,, 



x,y 



a^{x,y) 

C^{x,y)Cj{x,y) 
(T^(x,y) 



(8) 



where a'^{x,y) represents the known variance per pixel. 
The creation of the matrices Mij and bi therefore requires 
a convolution of the reference image with each basis ker- 
nel. 

The least-squares estimate for a is a = {C^ C)^^ S . 
A difference image is then constructed as D{x, y) ~ 
S{x, y) — Cd{x, y). Because the estimate of a is explicitly 
dependent on both S{x,y) and R{x,y), the residuals in 
the difference image may not necessarily follow a normal 
A/'(0, 1) distributiorQ, with cr^ 7^ 1 due to this covariance. 
The residuals should however have a flat power spectral 
density. 

2.2. Invertability of C^Y~^C 

When a large set of basis functions is used, the matrix 
AI = C^Tj~^C may be ill-conditioned or even singular. 
This can be quantified by the "condition number" of M , 
which we define as the ratio of the largest to the smallest 
eigenvalues. When the condition number is large, inver- 
sion of M will be numerically unstable or infeasible. 

A common approach when trying to invert an ill- 
conditioned matrix is to compute instead a pseudo- 
inverse, or an approximation to one in which eigenval- 
ues that arc numerically small are zeroed out. As M is 
symmetric, we can decompose it as M = VDV^ with 
V an orthogonal matrix and D = diag(di, . . . , dn) with 
eigenvalues di > d2 > ... > fin > 0. We define Di = 

* We use the mean and variance, not mean and standard devia- 
tion, as the two parameters of Normal distributions. 
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diag((ii, . . . , di, 0, . . . , 0) to be a truncation of D where 
di/di+i becomes too large. Then, we define the pseudo- 
inverse of Di as D] = diag(l/(ii, . . . , l/d^, 0, . . . , 0). Note 
this allows for the definition of a pseudo-inverse of M as 
Aff = VDIv'^. Analogous to A, define to be the 
same as the matrix V in the first i columns, and zero 
elsewhere. Typically this truncation threshold is defined 
by the machine precision of the computation (e.g. for 
double-valued calculations, 1/dmin ~ 5 x 10^^). How- 
ever, significantly larger limits for dmm may be used 
to avoid underconstrained parameters, such as in Sec- 
tion EXH 

3. SUM-OF-GAUSSIAN BASES 

The original PSF- matc hing bases proposed by 
lAlard fc Luptol] (|l998l ) and lAlardI (|2000l ) (referred to 
here as "Alard-Lupton" or AL bases) used a sum of 
multiple Gaussians, each modified by a 2-dimensional 
polynomial: 



(9) 



where the index i runs over all permutations of n,p,q. 
This basis set effectively uses n = 1 . . . N Gaussian com- 
ponents, each with width a„, and eac h modified by a set 
of Gauss-Hermite polynomials (e.g. IWnschd I2000D ex- 
panded out to order 0„ (0 < p -I- g < 0„). The total 
number of basis functions in the set is ^ (0„ -f 1) x 
(0„+2)/2. 

The number N and width ct„ of the Gaussians, as well 
as spatial order of the polynomials 0„, are configurable 
but are not fitted parameters in the linear least-squares 
minimization. Therefore these are tuning parameters of 
the model. Typically, a-priori information such as the 
widths o f the image PSFs is used to choose these val- 
ues (e.g. [Israel et al.ll2007f). In a representative imple- 
mentation ( Smith et al.ll2002D , three Gaussians are used, 
with the narrowest Gaussian expanded out to order 6, 
the middle to order 4, and the widest to order 2. This 
leads to a total of 49 basis functions used in the kernel 
expansion. 

The practical application of this algorithm has been 
very successful, and it has been used by various 
time- domain surveys such as MACHO ( Alcock et al.l 
[19991) ■ OGLE (lWozniakl l2000 HUdalski et al.ll2008D MOA 
(IBond et all 120011) ■ SuperMACHO (I Smith et al.l [200l 



Rest et al l I2005D . the Dee p Lens Survey (jBecker et al.l 
200J), ESSENCE (Miknait lseFall 120071) . the SDSS-II 



Supernova Survey (iSako et al.ll2008[) . and most recently 
analysis of commissi oning data from Pan-STARRS 
(jBotticeha et al.|[20lol ). 

The top row of Figured] shows an instance of successful 
PSF matching using this sum-of-Gaussians basis. The 
first column represents a high signal-to-noise image of a 
star R(x, y) generated from an image coaddition process 
applied to data from the Canada-France-Hawaii Tele- 
scope (CFHT). The second column shows this same star, 
aligned with the template image to sub-pixel accuracy, 
in a single science image S{x,y). The star is obviously 
asymmetric, potentially due to optical distortions such as 
focus or astigmatism, or due to tracking problems during 
acquisition of the image. The PSF-matching kernel thus 
will need to take the symmetric R{x, y) and elongate it 
along a vector oriented approximately 135 deg from hor- 



izontal. The first row, third column shows the best-fit 
PSF-matching kernel using = 3 Gaussians with (T„ 
= [0.75, 1.5, 3.0] pixels, and each modified by Hermite 
polynomials of order 0„ = [4, 3, 2], respectively. The 
total number of terms in the expansion is 31. The first 
row, right column shows the resulting difference image 
D{x,y). The subtraction is obviously very good, with 
the remaining pixels described by a A/'(0.01, 1.01) distri- 
bution. 

3.1. Limitations of the Model 

The intrinsic symmetries of Hermite polynomials (sym- 
metric for even order, anti-symmetric for odd order) 
means that the Gauss-Hermite bases possess a high de- 
gree of symmetry about the central pixel. This makes it 
difficult to concentrate the kernel power off-center when 
using an incomplete basis expansion. Such functional- 
ity is necessary when the flux needs to be redistributed 
on the scale of the kernel size, such as when there are 
astrometric misalignments. While it is possible to com- 
pensate for misalignment using kernels derived from this 
basis, this requires concentrating the kernel strength in 
the high-order terms. There are practical limitations to 
the efficacy of this including the scale and orientation of 
the required shift, and the number of basis terms used. 

As a concrete example, the second row in Figure [T] 
shows the best-fit kernel derived when there is a 3-pixel 
shift in both the x and y directions. The kernel needs 
to have power in the first quadrant (upper right) at the 
scale of 3 pixels. The image of the kernel (third column) 
shows that while it is obviously able to do so, the match- 
ing suffers in the third quadrant, as the difference image 
shows obvious residuals. These pixels result in an unac- 
ceptable A/'(0.01, 1.44) distribution; recall we were able 
to yield = 1.01 for well-registered images (top row). 

Another limitation of the model is that there are a va- 
riety of tuning parameters. This includes the number of 
Gaussians in the basis, their widths, and their spatial 
orders. These parameters are typically chosen using a 
set of heuristics. If there is a mismatch compared to the 
true underlying kernel, this process will fail. The third 
row of Figure [T] shows PSF-matching results when the 
basis Gaussians are too big and are unable to reproduce 
the small-scale differences in the PSFs. This yields ob- 
vious residuals in the difference image, which follow a 
A/'(0.02, 3.03) distribution. The fourth row of Figure [U 
shows results when the Gauss-Hermite polynomials are 
not allowed to vary to high enough order, also yielding 
unacceptable 7V(0.02, 2.86) residuals in the difference im- 
age. 

Clearly the results of this process are sensitive to the 
choice of several tuning parameters, which makes this 
difficult to implement robustly. In a statistical sense, 
selection of tuning parameters (which includes selecting 
the number of basis functions used) usually has a much 
larger effect on performance than does the choice of ba- 
sis functions. A process that results in a reduction in 
the number of kernel tuning parameters, while maintain- 
ing the quality of the difference images, would greatly 
improve the effectiveness of this method. 

4. DELTA-FUNCTION BASES 

The most general technique for modeling K{u, v) is 
to use a "shape free" basis, which consists of a delta 
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Figure 1. Difference imaging results when using a sum-of-Gaussian basis. The first column shows the reference image to be convolved 
R{x,y), the second shows the science image S{x,y) the reference is matched to, the third column shows the best-fit 19 X 19 pixel PSF 
matching kernel K{u,v), and the fourth column shows the resulting difference image D(x,y). Row 1: Results when using a basis set with 
Cn = [0.75, 1.5, 3.0] pixels. On = [4, 3, 2]. Row 2: Results when the images are mis— registered by 3 pixels in both coordinates, requiring 
significant off-center power in the kernel. Row 3: Results when the basis Gaussians are too large compared to the actual PSF-matching 
kernel ((t„ = [3.0, 5.0] pixels, On = [3, 2]). Row 4- Results when the polynomial expansion is not carried to high enough order (a„ = 
[0.75, 1.5, 3.0] pixels, 0„ = [1, 1, 1]). 



function at each kernel pixel index: Kij{u,v) = S{u — 
i)d{v — j). A kernel of size 19 x 19 will then have 361 
orthonormal, single-pixel bases. In this situation there 
are no tuning parameters, which is an obvious benefit. 
However, in any choice of basis there is a trade off be- 
tween flexibility in the forms the fitted function can take, 
and variability in the resulting fit (the so-called "bias- 
variance" trade off). The delta-function basis provides 
complete flexibility, and as such can account for features 
such as arbitrary off-center power requi red to compen - 
sate for astrometric misregistration (e.g. lBramichll2008f ). 
But to avoid gross overfitting, that flexibility needs to be 
tempered to keep the variance in check. 

Figure [2] shows the results of PSF-matching using such 
a basis, using the same objects as in Figure [TJ The top 
row demonstrates the results for exactly aligned images, 
while the bottom row demonstrates the results for images 
misaligned by 3 pixels in both x and y. The difference 
images are qualitatively similar. However, the best-fit 
solutions obviously yield large variations within the ker- 
nels themselves, and do not match expectations of what 
the actual kernel should look like. The reason for this 
can be found in the distribution of pixels residuals in 
the difference image. Both images follow a A/'(0.01, 0.79) 



distribution. This indicates that the residuals have lower 
variance than Gaussian statistics would suggest. Indeed, 
in Figure [2] column 4 the residuals appear smoother than 
random noise. This is impossible unless we have overes- 
timated the variance in our images, or unless the kernels 
themselves are removing some fraction of the noise. 

The large numbers of basis shapes (361 degrees of free- 
dom vs. 31 for the sum-of-Gaussians) makes it highly 
likely that we are over-fitting the problem. The kernel 
thus has the ability to match both the underlying signal 
and the associated noise in the two images. So while this 
technique is optimal for matching pixels in two images - 
where those pixels are a combination of signal and noise 
- it is not necessarily optimal for uncovering the true 
PSF-matching kernel. 

A consequence of this is that the PSF-matching kernel 
derived for any given object may not be directly applied 
to neighboring objects, since the solution is significantly 
driven by the local noise properties. High variance es- 
timators are particularly poor as inputs to interpolation 
routines, or to a spatial model of the kernel K{u, v, x, y), 
that find the matching kernel at all locations as a func- 
tion of the fitted kernels at particular locations. Below, 
we explore how introducing a certain amount of bias into 
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this estimator can improve its performance. 



normal equations (Equation [51) with strength A 



5. DELTA-FUNCTION BASES WITH REGULARIZATION 

The delta-function basis can flexibly fit a kernel of any 
form, but as we have shown, this fiexibility is both its 
strength and weakness. As is, the method significantly 
overfits, absorbing substantial noise fiuctuations into the 
fit and thus giving estimated kernels with excessive vari- 
ance. A solution is to introduce some amount of bias into 
the fit to reduce the solution variance by a much larger 
factor (if "bias" sounds pejorative, note that this is just a 
kind of smoothing) . When fitting a smooth function such 
as K{u, V, X, y), we prefer fitted kernels for which nearby 
solutions do not vary too greatly. This bias will enable 
such a fit with vastly reduced mean-squared error. 

Among the various approaches to dealing with overfit- 
ting, the most common are thr ough linear regula rization 
techniques (e.g. Section 18.5: iPress et al.|[l992D . Using 
these, we may penalize undesirable features of the fit, 
usually by adding a penalty term to our optimization 
criterion. For instance, when fitting a smooth function, 
we want to penalize fits / that are too rough or irregular. 
One way to do this is to add to the least squares objective 
a term penalizing the second derivative, A ■ J \ f" {x)\'^dx. 
Here, the scaling factor A is a tuning parameter that de- 
termines the balance between fidelity to the data and 
the desired smoothness. In the case of kernel matching, 
we may extend this idea with a two dimensional penalty 
that approximates A • / / \y f {x , y)\'^ dxdy . 

The one-dimensional second derivative of a function / 
around pixel x may be approximated using the central 
finite difference /"(x) « f{x-l)-2f{x) + f {x+l). Since 
the delta-function bases have unit height and no intrinsic 
shape, regularizing the coefficients is equivalent to reg- 
ularizing the shape of the resulting kernel (care must be 
taken to apply the regularization penalty to only those 
pixels that are associated spatially). In matrix terms, 
this one-dimensional regularization may be represented 
by i?ia, with 



b = Mxa. 



XH)a 



(12) 



Vo 



-2 1 
1-2 1 



(10) 



1-2 1/ 



which is of dimension (m — 2) x m, where m here is the 
total number of pixels in the kernel 0. A generalization 
of this to two dimensions results in a 5-point stencil that 
sums the local derivative along both axes, f"{x,y) « 
f{x^l,y) + fix + l,y) + f{x,y-l) + fix,y+l)~Afix,y), 
with an associated matrix i?2. 

The finite calculation of this penalty is implemented 
through the matrix equations 



l-R2a|- 



(2 



(11) 



where a represents the amplitude of each delta function, 
and i?2 encapsulates the coefficients that approximate 
the local 2"** derivative of the resulting kernel. We define 
the matrix H = Rj R2, which makes the second deriva- 
tive penalty aJ Ha. This matrix is used to regularize the 

^ The absolute value of the kernel's border pixels may also be 
penalized through the addition of a row at both the top and bottom 
of _Ri. 



Note the similarity to Equation 21 with the only differ- 
ence being M\ = M+XH. Here A represents the strength 
of the regularization penalty, and is the sole tuning pa- 
rameter in this model. 

Figure |3l shows results for the same set of objects dis- 
played in Figure □ and Figure 12 but using regularization 
of the delta-function basis set. The top row shows the 
results for aligned images, and A = 1. Note that the 
kernel looks very much as anticipated, being compact 
and having a shape aligned approximately 135 deg from 
horizontal. Residuals in the difference image follow a 
A/'(0.01, 0.94) distribution. The second row shows the re- 
sults when the images are misaligned by 3 pixels in x and 
y. The kernel merely appears shifted by the same amount 
compared to the aligned images, and the difference im- 
age follows a quantitatively similar A/'(0.01, 0.96) distri- 
bution. This effectively demonstrates that this method 
can reproduce kernels with off-center power. The third 
row shows the results with A = 0.01; the shape of the 
PSF-matching component of the kernel is just barely dis- 
cernible above its noise, suggesting the regularization is 
too weak. The difference image is, however, acceptable 
(A/'(0.01, 0.81)). The fourth row shows the results with 
A = 100. The kernel is far smoother than in previous 
runs. However, this appears to be at the expense of resid- 
uals in the difference image, which follow a A/'(0.01, 1.35) 
distribution. This suggests that too much weight has 
been given the smoothness of the kernel compared to the 
residuals in the difference image, indicating that the reg- 
ularization is too strong. The general trend is that with 
increasing lambda, the variance in the difference image 
increases. The noise properties of the difference image 
evolve from being too smooth, to approximately white in 
spectrum, to having residual features at a similar scale 
as the kernel. 

Overall, this technique appears very effective. We are 
able to create general, compact kernels that represent the 
underlying shape of the PSF-matching kernel with only 
one tuning parameter, the strength of the regularization 
A. The role of A is effectively to exchange variance in the 
resulting difference image with variance in the kernel it- 
self. By increasing the value of A, we are able to smooth 
the kernel while increasing the variance in the difference 
image. We explore various methods to establish the op- 
timal value of A below. 

5.1. Choice of Tuning Parameter A 

Choosing a good tuning parameter is essential for good 
performance of a regularization method. If A is too high, 
the fit will be too smooth (high bias, low variance); if A is 
too low, the fit will be too rough (low bias, high variance). 
The goal of data-driven methods for choosing tuning pa- 
rameters is to find the sweet spot in the bias-variance 
trade off. While choosing a good value for A is a hard 
statistical problem, there are a variety of methods that 
have proven successful in practice. These methods con- 
struct a statistical estimate of mean-squared error and 
choose A to minimize it. For instance in cross-validation 
(reviewed in lKohavl.l995i) . the data set is broken into 
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Figure 2. Difference imaging results when using a delta-function basis. Columns are the same as in Figure [T] Row 1: Results when 
using an unregularized delta-function basis. Row 2: Results when the images are mis-registered by 3 pixels in both coordinates. 










Figure 3. Difference imaging results when using a regularized delta-function basis. Columns are the same as in Figure [T] Row 1: 
Results when using a regularized delta-function basis with A = 1.0. Row 2: Results when the images are mis-registered by 3 pixels in 
both coordinates, A = 1.0. Row 3: Results using "weak" regularization of the kernel, with A = 0.01. Row 4- Ilesults using "strong" 
regularization of the kernel, with A = 100. 



pieces, and each piece is left out in turn during the fit. 
The (prediction) mean-squared error is derived from the 
average squared error of the fits in predicting the part 
of the data that was left out. Anothe r approach, called 
empirical risk estimation (ISteinllTOSlI ). uses the data it- 
self to compute an (unbiased) estimate of original fit's 
mean-squared error and chooses A to minimize it. The 
theoretical justification for these methods is that, when 



properly done and with sufficiently large data sets, the 
chosen A is close to the value that minimizes the corre- 
sponding mean-squared error function. 

A second tuning consideration is that frequently a set 
of fitted kernels will be used to constrain a spatial model 
K(u, V, X, y) that will be applied to aZZ pixels in an image. 
Therefore wc must give a large weight to our ability to 
interpolate between the ensemble of kernel realizations 
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used to constrain K{u,v,x,y). One metric for this is 
to examine the predictive power of a kernel derived from 
one object, and applied to a neighboring object. At small 
separations, the quality of each difference image should 
be similar, indicating that the initial solution was not 
significantly driven by the local noise properties. 

We explore the practical application of these 
ideas below using several sets of CCD images from 
the Canada-France-Hawaii Telescope plus Megacam 
imager, calibrated using the ELIXIR pipeline of 
IMagnier fc Cuillandrd (j2004D . The template image used 
is the median of several images into a single high S/N 
representation of the field. The variance per pixel is de- 
termined from the image pixel values divided by the gain. 

5.1.1. Empirical Risk Estimation 

We first construct a loss function that represents the 
sum of squared differences between the true (unknown) 
kernel coefficients a and a\ , which is our estimate of the 
kernel coefficients when the tuning parameter is set to 
the value A: 

L{a,ax) = iax - a^iax - a). (13) 

The expectation value of L(a, ax) is the statistical risk we 
will minimize through our choice of A0. When M is well- 
conditioned, we can construct an unbiased estimator of 
the true risk {L{a,ax)) as (Section 2. iSteinlflQSlf ) 

^(A)-||aA||^+ (14) 
2 (trace(MA) ~ ajM-'^b) . 

We note that this estimator of risk does not require 
tuning parameters. If we let A be the minimizer of i?, 
then we choose as the estimate of a. 

For the circumstance that M is ill-conditioned, we 
present an adjustment to R from Eauation ll4l Following 
the notation from Scction [2?2l for any i define 11 = ViV^ . 
This corresponds to 11 being a projection matrix onto the 
space of the eigenvectors of M that correspond to its i 
largest eigenvalues. Note that i is now an additional tun- 
ing parameter, corresponding to choice of condition num- 
ber (denoted by symbol A) for matrix M (Section 12. 2p . 
A biased estimate of the statistical risk is then: 

R{\)^\\Iiax\\l+ (15) 

2 (trace(nAfA) - alu'^h) . 

While introducing bias into the estimator of statistical 
risk seems bad, it can be necessary in situations where M 
is ill-conditioned. Small eigenvalues of M corresponds to 
there being very little information along the associated 
eigenvectors. By zeroing out these eigenvalues we are 
effectively saying we cannot reliably estimate with this 
little amount of information. Hence, we concentrate on 
getting the estimation correct on the eigenvectors with 
larger eigenvalues. 

For each object detected in the CFHT images, and for 
given values of condition number A ranging from 4 < 
log(A) < 6, we evaluate i?(A) at values of —2 < log(A) < 
2. Figure m shows a typical outcome of this analysis for 

® It should be noted that other risk estimators may be con- 
structed, e.g. ones that maximize the quality of the full difference 
image. 



a single object. Along the y-axis we show the associated 
value of the conditioning parameter A, and along the x- 
axis the value of A at which R{X) is evaluated. The solid 
line shows the minimum value of R{\) for each A. 

We note that as we decrease the acceptable matrix 
condition number, thereby truncating more eigenvalues 
from the matrix pseudoinverse, the optimum value of A 
increases. For matrices with effectively no conditioning 
(large A), the optimal value of A is near A = 0.5. This is 
in fact the global minimum of the risk. A similar result 
is obtained by looking at all objects within an image and 
summing their cumulative risk surfaces. We regard A = 
0.5 as the value preferred by the empirical risk estimation 
technique, with a range of nearly-equivalent risk between 
0.3 < A < 1.0. 

5.1.2. Predictive Ability 

In most PSF-matching implementations, several dozen 
objects across a pair of registered images are used to 
create individual K(u,v); ideally these should evenly 
sample the spatial extent of the images. Due to spa- 
tial variation in the PSFs of the images, caused by op- 
tical aberrations or bulk atmospheric effects, the sin- 
gle kernel that PSF-matches all objects in an image 
must itself vary spatially. In this case each of the ker- 
nels K{u,v) are used to build spatially varying PSF- 
matching kernel K(u,v,x,y). This is typically imple- 
mented as spatial variation on the kernel coefficients 
K{u,v,x,y) = '^^ai{x,y)Ki{u,v). Therefore an addi- 
tional consideration in the choice of A is the ability to 
build spatial models for the coefficients a{x,y). 

To quantify this, we examine the predictive ability of 
the kernel solution ax- In all CFHT images, we identify 
object pairs separated by more than 5 pixels but less than 
50, a range of separations where we expect the intrinsic 
spatial variation of the underlying kernel to be minimal. 
The kernel derived for each object in a pair is applied to 
its complement, and the quality of each difference image 
assessed. For components A and B of each object pair, 
this yields difference image Daa which is the difference 
image of object A with kernel A, Dab which is the dif- 
ference image of object A with kernel B, and analogous 
images Dba and Dbb- We assess the quality of each 
difference image using the width of the pixel distribution 
normalized by the noise, defined as e.g. aAA, within the 
central 7x7 pixels of the difference image. While we 
don't expect this distribution to have a width of exactly 
1.0 due to covariance between the solution and the input 
images, we do desire that the quality of Dab and Dba 
should not be significantly worse than that of Daa and 
Dbb- 

We aggregate the "even" statistics ctaa and aBB into 
distribution E^, and the "odd" statistics {(tab, o^ba) 
into Tjq- We further examine the distribution of Eq_^. 
which is created from all measurements of — cr^^ and 
<JgA ~ '^BB- This statistic reflects the deterioration in 
an object's difference image when using a counterpart's 
kernel, compared to the optimal kernel derived for that 
object. 

We plot the distributions of these values in Figure [SJ 
The top panel provides the median values of these dis- 
tributions for the sum-of-Gaussian (AL) basis (left), 
for the unregularized delta function basis (A = 0; cen- 
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Regularization strength A 

Figure 4. Values of the empirical risk R{X), as defined in Equation 1141 for different values of the matrix conditioning parameter A, and 
the regularization strength A. At all A, we determine the minimum values of R{X), which are connected by the solid black line. The dotted 
vertical lino represents the fiducial value of A = 1. The global minimum of i?.(A) is realized with minimal matrix conditioning, and at a 
value of A = 0.5. 



ter), and for delta-function regularization strengths of 
—2 < log(A) < 2 (right). The bottom panel plots the 
effective standard deviation of the distribution, defined 
as 74% of the interquartile range. 

The lowest median residual variance E^; comes from 
difference images made using an unregularized A = 
basis, the reasons for which we have examined in detail in 
Section m However, as expected the predictive ability of 
this basis is by far the worst, having the highest median 
Eq_^, as well as large variance within this distribution. 
As we ramp up the regularization strength, the predictive 
ability of the kernels increases (low but at the 

expense of the quality of the difference image itself (large 
Eis). 

To find an acceptable medium between these two con- 
siderations, we will use the results from the sum-of- 
Gaussian (AL) basis as a benchmark, since it has been 
shown to produce effective spatial models (Section [3]). 
For the AL basis, the median values of E^;, Eq, and 
Eq_£; are 0.99, 1.14, and 0.28, respectively. Similar 
results are obtained with delta-function regularization 
strengths of A « 0.2, 0.7, and 0.2. For AL the (Tmedian 
values of E^, Eq, and ^q-e ^^"^ 0.14, 0.33, and 0.74, 
respectively. These are matched (or bested) in the reg- 



ularized basis for A < 0.2, A = 0.2, and 0.2 < A < 6, 
respectively. 

In summary, using delta-function regularization 
strengths of A w 0.2, we are able to achieve difference im- 
ages with a similar quality to those yielded by the sum- 
of-Gaussian AL basis (using E^ as our metric). These 
models have similar predictive ability when applied to 
neighboring objects (quantified using Eq and Eq_^), 
making them useful for full-image spatial modeling. Fi- 
nally, they are seen to be generally applicable, having 
a small variance in the above statistics when evaluated 
over several hundreds of object pairs. 

6. CONCLUSIONS 

We've examined here the choice of basis set on the 
quality of PSF-matching kernels and their resulting dif- 
ference images. These include the traditional sum-of- 
Gaussian ( "Alard-Lupton" ) basis and a digital basis 
based upon delta-functions. We find that while the 
delta-function kernels are the most expressive, they are 
also the least compact in terms of localization of power 
within the kernel. Having one basis component per pixel 
in the kernel, they tend to overfit the data and are more 
sensitive to the noise in the images instead of the intrinsic 
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Figure 5. Median statistics assessing the predictive ability of different kernel bases. The top panel shows the median values of statistics 
Eg (rod circle and solid line), So (blue square and dashed line), and Sq^^ (green triangle and dotted line) for "Alard-Lupton" (AL) 
bases, for delta-function bases with A = 0, and then for a range of —2 < log(A) < 2. All statistics are defined in Section 15.1.21 The bottom 
panel shows the standard deviation of the distribution, defined as 74% of the interquartile range. 



PSF-matching signal. 

We introduce a new technique of linear regularization 
to impose smoothness on these delta-function kernels, 
at the expense of slightly higher noise in the difference 
images. These regularized shapes are shown to be flex- 
ible, and yield solutions with sufficient predictive power 
to prove useful for spatial interpolation. We outline two 
methods to determine the strength of this regularization 
that minimize the statistical risk of the kernel estimate, 
and that examine the predictive ability of the derived 
kernels. Both methods suggest values of A that are be- 
tween 0.1 and 1.0. 

Given the large range of image qualities used in image 
subtraction pipelines compared to the small number of 
images used in the analysis here, we caution that these 
estimates may not be applicable under all conditions and 
should really be estimated on a dataset-by-dataset ba- 
sis. The optimal value of A will be a function of the 
S/N in the template and science images, which should 
affect the level of kernel smoothing needed, and of the 
respective seeings in the input images, which may im- 
pact the suitability of our finite-difference smoothness 
approximation. 

While this implementation appears successful and 



practical, there are various improvements we might con- 
sider in our regularization efforts. This includes changing 
the scale over which the regularization stencil is calcu- 
lated based upon the seeing in the images; currently this 
is being done in pixel-based coordinates, and not ad- 
justed depending on the full- width at half-maximum of 
the input PSFs. Wc also plan to examine additional met- 
rics to determine the optimal value of A, including the 
power spectrum of noise in the resulting difference image, 
which should be flat. Ultimately, the overall quality of 
the entire difference image is the optimal metric to use in 
assessing choice of basis; we will be expanding our anal- 
ysis to include full-image metrics and spatial modeling 
of the kernel. 

Finally, the wealth of statistical techniques to effi- 
ciently choose basis shapes has not been exhausted. 
Other potential methods include the use of overcom- 
plete bases, where the choice of the correct subset of 
comp onents to use is made though through basis pur- 
suit (jChen et al.l I1998D , as well as the process of "ba- 
sis shrinkage" through the use o f multi-scale wavelets 
(|Donoho fc Johnstonelll994ll995f ). In all considerations, 
it is an advantage to yield solutions that, as an ensemble, 
have a low dimensionality so that spatial modeling is ef- 
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ficient and spatial degrees of freedom are not being used 
to compensate for an inefficient choice of basis. However, 
for any given basis set the choice of regularization (none 
at ah or using a fixed set of functions) is hkely to be the 
proper place for optimization. 

This material is based, in part, upon work supported 
by the National Science Foundation under Grant Number 
AST-G7G9394. 
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